Graded task Logistic regression
Your task is to create a logistic regression model and to predict whether the patient has 10-year risk of future coronary heart disease (CHD). The data is given in the sheet Heart_desease. The dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts.
Requirements: 1. Create a logistic regression model and estimate predictions. 2. Interpret models’ coefficients and Odd ratios. 3. Assess the model with performance metrics.”
Evaluation criteria: “1. How well the model creation in the task description is followed. 2. You can explain how you selected the best model. 3. You can interpret the models’ performance metrics and parameters. 4. You present fluent and clearly. Presentation should take 10-15 minutes. 5. Analytical approach to the problem”
Sample questions: “1. How have you selected your best model? 2. Why have you used the particular model’s performance metrics? 3. What is Odds ratio? 4. How do you interpret regression model’s intercept? 5. How do you interpret regression model’s coefficients?”
Medical( history)
Predict variable (desired target) - 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)
Abbreviations: - EDA : Exploratory Data Analysis - EV : Explanatory Variable (i.e. predictor)
library(tidyverse)
library(janitor)
library(emmeans)
library(dlookr)
library(flextable)
library(jtools)
library(GGally)
library(vcd)
library(ggstatsplot)
library(gridExtra)
library(caret)
library(jtools)
library(kableExtra)
select <- dplyr::select
bd <- "/Users/leonardo/My Drive/turing_college/Modules/Linear_and_Logistic_Regression/"
df <- read_csv(paste0(bd,"Graded_task_Heart_disease_data.csv")) %>%
tibble %>%
janitor::clean_names()
# change appropriate variables to factor
df <- df %>% mutate_at(
c("male","education","current_smoker","bp_meds",
"prevalent_stroke","prevalent_hyp",
"diabetes","ten_year_chd"), factor
)
# df %>% glimpse()
# dlookr::diagnose(df) %>% flextable
To estimate the perfomance of the model with out-of-sample data, we split the whole dataset in a train (60% - 2543 data points) and test (40% - 1695 data points) set.
set.seed(124)
N = nrow(df)
split_index <- sample(c("train","test"), N, replace = T, prob = c(0.6,0.4))
train <- df[which(split_index == "train"),] # train set
test <- df[which(split_index == "test"),] # test set
We will carry out the check for outliers - and subsequently evaluate potential imputation for NA - only on the train data, and blindly apply this to the test. In a real world situation, the test set can grow continously, therefore we need to take decisions only on the train data.
The evaluation for potential NA imputation will be carried out later, just before modelling. The reason for this is that carrying out NA imputation first would bias the EDA.
Observations All the numerical values are in the norm, except for blood pressure. Specifically, the systolic blood pressure sys_bp is way too high, with about 4% of the participants having a bp > 180 (requiring immediate medical care).
However, in published papers on the Framinghton studies these values are accepted as plausible. We will therefore only exclude values above 250 mmHg.
# train %>% select_if(is.numeric) %>% summary()
# boxplot(glucose ~ diabetes, data = train, main = "glucose ~ diabetes")
boxplot(sys_bp ~ bp_meds, data = train, main = "systolic blood pressure ~ bp_meds")
boxplot(sys_bp ~ prevalent_hyp, data = train, main = "systolic blood pressure ~ hypertension")
# percent of people with sys_bp > 180
pct_high_sys_bp <- round((train$sys_bp > 180) %>% sum() / nrow(train) * 100,2)
# xtabs(~ prevalent_hyp + bp_meds, data = train)
train <- train %>% filter(sys_bp <= 250)
test <- test %>% filter(sys_bp <= 250)
Initially, we carry out t-tests to discover which EVs exhibit (corrected) significant differences between people with and without risk of CHD.
Then we will also examine the potential correlations between variables. If the latter are substantial, we will evaluate the possibility of discarding some EVs due to collinearity.
(an excercise in fitting many models using purrr::map)
library(tidyr)
ttests <- train %>%
# select("ten_year_chd", "age", "glucose") %>%
select(-c("education","male","current_smoker","bp_meds",
"prevalent_stroke","prevalent_hyp", "diabetes")) %>%
na.omit() %>%
# pivot longer to prepare the nesting
pivot_longer(
cols = -ten_year_chd,
names_to = "variable", values_to = "value"
) %>%
group_by(variable) %>%
group_nest() %>%
# fit a model to each nested df
mutate(
ttmod = map(data, ~ lm(value ~ ten_year_chd, data = .x))
) %>%
# get the stats
mutate(
tidy = map(ttmod, broom::tidy)
) %>%
unnest(tidy) %>%
filter(term == "ten_year_chd1") %>%
# correct for multiple comparisons
mutate(adj_pval = p.adjust(p.value, method = "fdr")) %>%
# filter(adj_pval <= 0.01) %>%
relocate(adj_pval, .after=variable) %>%
arrange(adj_pval)
ttests %>%
select(variable, statistic, adj_pval) %>%
rename(`t value` = statistic) %>%
flextable() %>%
set_formatter(adj_pval = function(x) sprintf("%.2e", x))
variable | t value | adj_pval |
|---|---|---|
age | 12.0815563 | 1.01e-31 |
sys_bp | 11.9732840 | 1.73e-31 |
dia_bp | 8.2215602 | 8.85e-16 |
glucose | 6.5653995 | 1.28e-10 |
bmi | 5.6308260 | 3.22e-08 |
tot_chol | 4.0326391 | 7.59e-05 |
cigs_per_day | 2.1141771 | 3.96e-02 |
heart_rate | 0.8512159 | 3.95e-01 |
ggbetweenstats(
data = train,
x = ten_year_chd,
y = age
)
ggbetweenstats(
data = train,
x = ten_year_chd,
y = sys_bp
)
#
ggbetweenstats(
data = train,
x = ten_year_chd,
y = dia_bp
)
#
ggbetweenstats(
data = train,
x = ten_year_chd,
y = dia_bp
)
#
ggbetweenstats(
data = train,
x = ten_year_chd,
y = bmi
)
#
ggbetweenstats(
data = train,
x = ten_year_chd,
y = tot_chol
)
Exploring the correlation matrix of the numerical variables to detect potential collinearity
train %>%
# select(-c("education","male","current_smoker","bp_meds",
# "prevalent_stroke","prevalent_hyp", "diabetes")) %>%
select(ten_year_chd, age, sys_bp, dia_bp, bmi, glucose, tot_chol) %>%
na.omit() %>%
# mutate_at(vars(-ten_year_chd), ~ log(. + 1)) %>%
GGally::ggpairs(
aes(color = ten_year_chd),
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("points", alpha = 0.7, size = 1)),
diag = list(continuous = wrap("densityDiag", alpha = 0.5))
)
age appears to be the most discriminant variable for ten year CHD prediction. Other EVs which are significantly different (after correction with FDR) in people with and without CHD risk are bmi, sys_bp, dia_bp, glucose, tot_chol.
the systolic sys_bp and diastolic dia_bp blood pressure are correlated with each other - as expected. Since CHD is especially related to hypertension, we will only use systolic blood pressure
bmi and age are also correlated with blood pressure, however they appear to be important variables for CHD and their correlation with blood pressure is mild, so we will keep them
the number of cigarettes per day (cigs_per_day) appears to have a small relationship with the group. A chi-square test pooling all the smokers and non-smokers (below) however fails to show a significant effect of smoking.
# Chi square test to assess whether there is a significant different in the risk
# of CHD for smokers and non-smokers
train %>%
mutate(smoker = ifelse(cigs_per_day == 0, 0, 1)) %>%
select(ten_year_chd, smoker) %>%
xtabs(~ ten_year_chd + smoker, data = .) %>%
chisq.test() %>% tidy() %>% kable(format = "html") %>%
add_header_above(c("Chi-Square CHD Risk ~ Smokers" = 4)) %>%
kable_styling()
| statistic | p.value | parameter | method |
|---|---|---|---|
| 1.526007 | 0.2167127 | 1 | Pearson’s Chi-squared test with Yates’ continuity correction |
# train %>% select_if(is.factor) %>% summary
As before for the numerical variables, we will first assess a dependence between risk of CHD and other categorical variables, and then explore the reciprocal relationships of all categorical variables to assess the presence of collinearity.
library(ggstatsplot)
ggbarstats(
data = train %>% select(ten_year_chd, male),
x = male,
y = ten_year_chd,
label = "both"
)
ggbarstats(
data = train %>% select(ten_year_chd, prevalent_hyp),
x = prevalent_hyp,
y = ten_year_chd,
label = "both"
)
ggbarstats(
data = train %>% select(ten_year_chd, current_smoker),
x = current_smoker,
y = ten_year_chd,
label = "both"
)
The absence of a relationship with smoking is quite surprising. One further visualization of the relationship between CHD risk and # of cigarettes per day shows that indeed there appear to be only a mild and not clear relationship.
# Suppress summarise info
options(dplyr.summarise.inform = FALSE)
train %>%
select(ten_year_chd, cigs_per_day) %>%
na.omit() %>%
mutate(cigs_factor = round(cigs_per_day/10,0)) %>%
group_by(ten_year_chd, cigs_factor) %>%
summarise(count = n(), .group = "drop") %>%
group_by(ten_year_chd) %>%
mutate(prop = count / sum(count)) %>%
select(ten_year_chd, cigs_factor, prop) %>%
ggplot(aes(x = factor(cigs_factor), y = prop, fill = ten_year_chd)) +
geom_bar(stat = "identity", position = "dodge")
ggbarstats(
data = train %>% select(ten_year_chd, bp_meds),
x = bp_meds,
y = ten_year_chd,
label = "both"
)
ggbarstats(
data = train %>% select(ten_year_chd, education),
x = education,
y = ten_year_chd,
label = "both"
)
Graphically display the association between binary categorical variables, including ten_year_chd to assess the presence of correlated predictors
library(vcd)
train_cat <- train %>% select_if(is.factor) %>% select(-c("ten_year_chd","education"))
pairs(
table(train_cat),
diag_panel = pairs_diagonal_mosaic(offset_varnames=-2.5), #move down variable names
upper_panel_args = list(shade = TRUE), #set shade colors
lower_panel_args = list(shade = TRUE)
)
Numerical variables Age (age), High blood pressure (sys_bp), body-mass index (bmi), glucose blood level (glucose), cholesterol blood level (tot_chol) appear to be involved in predicting the risk of CHD
Categorical variables Sex (male), hypertension (prevalent_hyp), blood pressure medications (bp_meds) and education (education) appear to be involved in predicting the risk of CHD.
There are widespread associations between numerical and categorical variables. This is not surprising for two reasons:
sys_bp, prevalence_hyp, bp_meds)age, the bmi increases, and both are likely to lead to increased blood pressure sys_bp, which in turn can lead to hypertension and relative medicamentsDespite this, we will enter all the selected variables in the model and assess the presence of collinearity using VIF.
library(gridExtra)
boxplot_tt <- function(response, predictor) {
tt <- t.test(train[[response]] ~ train[[predictor]], data =train)
res <- paste0("t = ", round(tt$statistic,2), " p = ", round(tt$p.value,4))
p <- train %>% na.omit() %>%
ggplot(aes(x = .data[[predictor]], y = .data[[response]])) +
geom_boxplot() +
labs(title = paste0(predictor," vs ", response),
subtitle = res)
return(p)
}
p1 <- boxplot_tt("age", "male")
p2 <- boxplot_tt("age", "prevalent_hyp")
p3 <- boxplot_tt("age", "bp_meds")
grid.arrange(p1, p2, p3, nrow = 1)
p1 <- boxplot_tt("bmi", "male")
p2 <- boxplot_tt("bmi", "prevalent_hyp")
p3 <- boxplot_tt("bmi", "bp_meds")
grid.arrange(p1, p2, p3, nrow = 1)
p1 <- boxplot_tt("sys_bp", "male")
p2 <- boxplot_tt("sys_bp", "prevalent_hyp")
p3 <- boxplot_tt("sys_bp", "bp_meds")
grid.arrange(p1, p2, p3, nrow = 1)
p1 <- boxplot_tt("glucose", "male")
p2 <- boxplot_tt("glucose", "prevalent_hyp")
p3 <- boxplot_tt("glucose", "bp_meds")
grid.arrange(p1, p2, p3, nrow = 1)
p1 <- boxplot_tt("tot_chol", "male")
p2 <- boxplot_tt("tot_chol", "prevalent_hyp")
p3 <- boxplot_tt("tot_chol", "bp_meds")
grid.arrange(p1, p2, p3, nrow = 1)
As a last step before fitting the models, we need to take care of NAs in these 6 variables. As mentioned in the beginning, we do this here in order not to bias the EDA.
In case of imputation, we will apply the same procedure blindly to the test set.
bmi is significantly different for males and females. There are only 12 values missing. We will impute the NAs with the median of each gender in either risk or no-risk group.
glucose and tot_chol are significantly different especially in people with and without hypertension. We will impute the NAs with the median of each gender in either risk or no-risk group.
bp_meds is a very important variable, and not easy to impute, so we don’t want to run risks here. We see from the table below that the people where blood pressure (bp) medication was not recorded have (median) bp values compatible with people who have either high or low blood pressure AND do not take medications. This could justify the imputation. At the same time, this is a very important variable, associated with drug use. We don’t want to run this risk. We will therefore discard people where it is not known whether they take medications.
# bmi ~ male
ggbetweenstats(
data = train,
x = male,
y = bmi
)
# glucose ~ hypertension
ggbetweenstats(
data = train,
x = prevalent_hyp,
y = glucose
)
# cholesterol ~ hypertension
ggbetweenstats(
data = train,
x = prevalent_hyp,
y = tot_chol
)
# bp_meds ~ prevalent_hyp
df %>%
select(sys_bp, prevalent_hyp, bp_meds) %>%
group_by(bp_meds, prevalent_hyp) %>%
summarise(
median_sys_bp = median(sys_bp),
count_prevalent_hyp = n()
)
## # A tibble: 5 × 4
## # Groups: bp_meds [3]
## bp_meds prevalent_hyp median_sys_bp count_prevalent_hyp
## <fct> <fct> <dbl> <int>
## 1 0 0 122 2891
## 2 0 1 150. 1170
## 3 1 1 165 124
## 4 <NA> 0 121 31
## 5 <NA> 1 154. 22
NA imputation on train and test set
# ------------------------- train --------------------------------------------
train <- train %>%
# impute the bmi as the median of males/females
group_by(ten_year_chd, male) %>%
mutate(bmi = ifelse(is.na(bmi), median(bmi, na.rm = T),bmi)) %>%
ungroup() %>%
# impute glucose and cholesterol level as the median of prevalent_hyp
group_by(ten_year_chd, prevalent_hyp) %>%
mutate(glucose = ifelse(is.na(glucose), median(glucose, na.rm=T), glucose)) %>%
mutate(tot_chol = ifelse(is.na(tot_chol), median(tot_chol, na.rm=T), tot_chol)) %>%
ungroup() %>%
# remove people where bp_meds is not unknown
filter(!is.na(bp_meds)) %>%
# remove people with no information about education level
filter(!is.na(education)) %>%
# # remove the mean from numerical variables
# mutate_if(is.numeric, ~ . - mean(., na.rm = TRUE)) %>%
# remove residual NAs
na.omit()
train %>%
select(
ten_year_chd, age, sys_bp, bmi, glucose, tot_chol,
male, prevalent_hyp, bp_meds, education
) %>%
diagnose %>%
flextable
variables | types | missing_count | missing_percent | unique_count | unique_rate |
|---|---|---|---|---|---|
ten_year_chd | factor | 0 | 0 | 2 | 0.0008093889 |
age | numeric | 0 | 0 | 39 | 0.0157830838 |
sys_bp | numeric | 0 | 0 | 213 | 0.0861999191 |
bmi | numeric | 0 | 0 | 1,099 | 0.4447592068 |
glucose | numeric | 0 | 0 | 121 | 0.0489680291 |
tot_chol | numeric | 0 | 0 | 227 | 0.0918656414 |
male | factor | 0 | 0 | 2 | 0.0008093889 |
prevalent_hyp | factor | 0 | 0 | 2 | 0.0008093889 |
bp_meds | factor | 0 | 0 | 2 | 0.0008093889 |
education | factor | 0 | 0 | 4 | 0.0016187778 |
# ------------------------- test --------------------------------------------
test <- test %>%
# impute the bmi as the median of males/females
group_by(ten_year_chd, male) %>%
mutate(bmi = ifelse(is.na(bmi), median(bmi, na.rm = T),bmi)) %>%
ungroup() %>%
# impute glucose and cholesterol level as the median of prevalent_hyp
group_by(ten_year_chd, prevalent_hyp) %>%
mutate(glucose = ifelse(is.na(glucose), median(glucose, na.rm=T), glucose)) %>%
mutate(tot_chol = ifelse(is.na(tot_chol), median(tot_chol, na.rm=T), tot_chol)) %>%
ungroup() %>%
# remove people where bp_meds is not unknown
filter(!is.na(bp_meds)) %>%
# remove people with no information about education level
filter(!is.na(education)) %>%
# # remove the mean from numerical variables
# mutate_if(is.numeric, ~ . - mean(., na.rm = TRUE)) %>%
# remove residual NAs
na.omit()
test %>%
select(
ten_year_chd, age, sys_bp, bmi, glucose, tot_chol,
male, prevalent_hyp, bp_meds, education
) %>%
diagnose %>%
flextable
variables | types | missing_count | missing_percent | unique_count | unique_rate |
|---|---|---|---|---|---|
ten_year_chd | factor | 0 | 0 | 2 | 0.001265823 |
age | numeric | 0 | 0 | 37 | 0.023417722 |
sys_bp | numeric | 0 | 0 | 201 | 0.127215190 |
bmi | numeric | 0 | 0 | 859 | 0.543670886 |
glucose | numeric | 0 | 0 | 102 | 0.064556962 |
tot_chol | numeric | 0 | 0 | 213 | 0.134810127 |
male | factor | 0 | 0 | 2 | 0.001265823 |
prevalent_hyp | factor | 0 | 0 | 2 | 0.001265823 |
bp_meds | factor | 0 | 0 | 2 | 0.001265823 |
education | factor | 0 | 0 | 4 | 0.002531646 |
It appears that there are two interesting questions:
How well is the risk predicted by generic factors alone? (such as age, bmi and blood pressure)
How much can we increase the prediction if we include other variables? Does the prediction decreases because of the correlation between numeric and categorical variables?
Since we allowed some correlated variables to enter the model, we will pay close attention to the VIF (variance inflation factor).
After the “best” model is detemined by manual stepwise regression, we will compare the former to an automatic stepwise regression carried out using the AIC criterion alone.
We start by fitting the “maximal” model, which includes all the provided EVs, to have a baseline estimate for the AUC of this model and of the impact of the correlation between variables we previously detected with EDA - for the latter, we will check the VIF.
Then we fit the full model, which includes all the variables we previously selected based on the EDA. This will show how well the hints gathered in the EDA are actually reflected in the significance of the coefficients. Specifically, compared to the “maximal” model.
We will then proceed to eliminate EVs based on whether the coefficients are not sigificant or have a very small effect (in terms of odds ratio). At each step we evaluate the AIC and AUC to detect whether reducing the EVs led to an increase or decrease of the performance.
Finally we will examine the confusion matrix and adjust the threshold for binary prediction in order to have maximum Sensitivity (more on this later).
The coefficients are reported in two forms (at the expense of verbosity):
standardized and exponentiated (using jtools::summ()), to allow: (1) interpreting the intercept in terms of probability, (2) to comparing the magnitude of the coefficients, (3) interpreting the coefficients as odds ratio.
original, non standardized and exponentiated (using summary()) to allow calculation of the probability of 10 year CHD given the actual values of the EVs for a specific person.
The intercept \(e^{\beta_0}\) is 0.11, meaning that women have a probability of about 10% to be at risk of CHD in ten years ($ p = $).
The AIC was similar across all examined models (range : 1887..1897), therefore this metric had a small weight in the choice of the final model. Similarly, all the models explained at most 17% of the total variance in the data, which is not a substantial amount.
The VIF was not critical, even in the maximal model, therefore the impact of correlated predictors was minimal or negligible.
The maximal model highlighted the contribution of gender, age, systolic blood pressure, and to a lesser extent the prevalence of hypertension and education. The AUC of this model was 0.703.
Our full model highlighted the same variables. In addition reducing the number of EVs increased the significance of most of the selected EVs (especially age and systolic blood pressure). However the AUC decreased to 0.695.
We then removed EVs with small or no effect from the full model, specifically blood pressure medications, bmi, cholesterol and education. This led to an AUC of 0.701, and all variables significant.
Further removing the EVs specifying whether the patient had hypertension did not substantially change the AUC (0.703).
Remarkably, this model contained only gender, age, systolic blood pressure and glucose blood level. That is, two common demographic variables, and two physiologic measurements which can be easily retrieved with blood pressure measurement and a simple blood exam
Specifically, the two EVs which mostly increased the odds of CHD risk in 10 years were:
It is at this point very interesting to notice three evidence from this model:
While we deemed this initially as the optimal model, running automaric stepwise regression (see results below) highlighted the importance of amount of cigarettes per day (cigs_per_day) which had escaped our EDA completely.
We therefore included also this EV in our model. The effect of this predictor turned out to be similar to that of glucose blood level, and increased the AUC to 0.713.
Our final model therefore includes:
Afterwards, we evaluated the performance of our model on the test data. Using a default probability of 0.5 for binary prediction leads to a poor Sensitivity, in that only ~ 7% of the people with a real 10 year CHD risk (\(\frac{16}{16+219}\)) are predicted as being at risk.
Since in our case the aim is to maximise the detection of people who are actually at risk - at expense of inflating the number of false negatives - we therefore lowered the threshold of binary prediction to 0.1. This allowed us to reach a sensitivity of 84%.
Finally, we hypothesized that the imbalance between Sensitivity and Specificity could be due to the high proportion of people not at risk in our dataset. Therefore, as a last test, we sampled an equal number of people with and without risk.
In this balanced samples model the AUC increased to 0.738. By using a threshold for binary decision of 0.3, this model correctly predicts 91.4% of the people with actual CHD risk, and a false negative rate of 70%. By adopting a more liberal threshold of 0.4, the model predicts 81.3% of the people with actual CHD, and about 50% of the people with no risk of CHD.
do_logistic_regression : write a function to carry out the logistic regression and print out the summary and the ROC curve and returns the estimated probabilities (phat) to use later to build the confusion matrix.
do_logistic_regression <- function(EVs, train_data = train, test_data = test, thresh = 0.5) {
# Assemble the formula
formula_string <- paste(EVs, collapse = " + ")
formula_obj <- as.formula(paste("ten_year_chd ~", formula_string))
fit <- glm(formula_obj, family = "binomial", data = train_data)
# Print a summary
# summary(fit) %>% print()
if (length(EVs) > 1) {
jtools::summ(fit, confint = T, vifs = T, scale = T, exp = T, digits = 2) %>% print()
} else {
jtools::summ(fit, confint = T, vifs = F, scale = T, exp = T, digits = 2) %>% print()
}
# # Standard R summary
summary(fit) %>% print()
# Estimated probability values from the logistic model
phat = predict(fit, newdata = test_data, type = "response")
# ROC curve using pROC::roc
# https://www.youtube.com/watch?v=qcvAqAH60Yw&t=493s
par(pty = "s")
pROC::roc(
test_data$ten_year_chd ~ phat, plot = TRUE, print.auc = TRUE, legacy.axes = TRUE,
xlab="% False Positives", ylab = "% True Positives", col = "#377eb8", lwd = 4
)
# Confusion matrix and derived quantities - using caret::confusionmatrix
sprintf("\n Using a threshold for prediction = %.2f \n\n", thresh) %>% cat()
thresh <- thresh
predicted <- ifelse(phat > thresh, 1, 0)
actual_predicted <- tibble(
actual = test_data$ten_year_chd,
predicted = ifelse(phat > thresh, 1, 0)
)
caret::confusionMatrix(
as.factor(predicted), # predicted
as.factor(test_data$ten_year_chd), # actual
positive = "1" # the positive class in our case is "1"
) %>% print()
return(phat)
}
Maximal model, including all the variables in the dataset. We use this only as a benchmark for the models instructed by EDA.
We observe high VIF values in sys_bp and dia_bp - as expected.
EVs <- train %>% select(-ten_year_chd) %>% colnames()
phat <- do_logistic_regression(EVs, thresh = 0.5)
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(17) = 259.91, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.17
## Pseudo-R² (McFadden) = 0.12
## AIC = 1890.45, BIC = 1995.07
##
## Standard errors: MLE
## -------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ----------------------- ----------- ------ ------- -------- ------ ------
## (Intercept) 0.11 0.08 0.15 -13.56 0.00
## male1 1.46 1.13 1.90 2.86 0.00 1.25
## age 1.68 1.46 1.93 7.33 0.00 1.32
## education2 0.72 0.54 0.97 -2.14 0.03 1.13
## education3 0.72 0.50 1.03 -1.79 0.07 1.13
## education4 0.95 0.65 1.40 -0.25 0.80 1.13
## current_smoker1 1.19 0.81 1.73 0.89 0.37 2.61
## cigs_per_day 1.19 0.99 1.42 1.90 0.06 2.74
## bp_meds1 1.46 0.82 2.59 1.28 0.20 1.11
## prevalent_stroke1 2.29 0.69 7.66 1.35 0.18 1.04
## prevalent_hyp1 1.44 1.04 2.00 2.21 0.03 1.94
## diabetes1 1.37 0.64 2.93 0.80 0.42 1.88
## tot_chol 1.02 0.90 1.15 0.31 0.76 1.07
## sys_bp 1.31 1.07 1.60 2.62 0.01 3.64
## dia_bp 1.01 0.84 1.21 0.06 0.95 2.92
## bmi 1.06 0.94 1.20 0.93 0.35 1.23
## heart_rate 0.92 0.81 1.04 -1.35 0.18 1.10
## glucose 1.16 1.02 1.32 2.23 0.03 1.89
## -------------------------------------------------------------------------
##
## Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8415 -0.5828 -0.4175 -0.2897 2.8226
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.4497344 0.8561698 -8.701 < 2e-16 ***
## male1 0.3800302 0.1329243 2.859 0.00425 **
## age 0.0608765 0.0083022 7.333 2.26e-13 ***
## education2 -0.3243521 0.1517706 -2.137 0.03259 *
## education3 -0.3327557 0.1860997 -1.788 0.07377 .
## education4 -0.0502180 0.1969542 -0.255 0.79874
## current_smoker1 0.1717930 0.1922979 0.893 0.37166
## cigs_per_day 0.0149836 0.0078998 1.897 0.05787 .
## bp_meds1 0.3755894 0.2932894 1.281 0.20033
## prevalent_stroke1 0.8294602 0.6158164 1.347 0.17800
## prevalent_hyp1 0.3669952 0.1660561 2.210 0.02710 *
## diabetes1 0.3126526 0.3897172 0.802 0.42241
## tot_chol 0.0004310 0.0014095 0.306 0.75980
## sys_bp 0.0121563 0.0046474 2.616 0.00890 **
## dia_bp 0.0004736 0.0078561 0.060 0.95193
## bmi 0.0141265 0.0151335 0.933 0.35058
## heart_rate -0.0068913 0.0050928 -1.353 0.17601
## glucose 0.0063839 0.0028683 2.226 0.02604 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1854.4 on 2453 degrees of freedom
## AIC: 1890.4
##
## Number of Fisher Scoring iterations: 5
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.50
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1329 213
## 1 16 22
##
## Accuracy : 0.8551
## 95% CI : (0.8367, 0.8721)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 0.3513
##
## Kappa : 0.1249
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.09362
## Specificity : 0.98810
## Pos Pred Value : 0.57895
## Neg Pred Value : 0.86187
## Prevalence : 0.14873
## Detection Rate : 0.01392
## Detection Prevalence : 0.02405
## Balanced Accuracy : 0.54086
##
## 'Positive' Class : 1
##
This is the “full” model with all the variables chosen based on the EDA.
No problems with collinearity.
EVs <- c(
"male", "prevalent_hyp", "bp_meds", "education",
"age", "sys_bp", "bmi", "glucose", "tot_chol"
)
phat <- do_logistic_regression(EVs, thresh = 0.5)
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(11) = 241.75, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1896.60, BIC = 1966.35
##
## Standard errors: MLE
## ----------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## -------------------- ----------- ------ ------- -------- ------ ------
## (Intercept) 0.11 0.09 0.15 -16.96 0.00
## male1 1.74 1.37 2.21 4.53 0.00 1.07
## prevalent_hyp1 1.41 1.03 1.94 2.13 0.03 1.87
## bp_meds1 1.54 0.87 2.70 1.49 0.14 1.09
## education2 0.73 0.54 0.98 -2.07 0.04 1.11
## education3 0.71 0.49 1.02 -1.84 0.07 1.11
## education4 0.96 0.65 1.40 -0.23 0.82 1.11
## age 1.61 1.42 1.84 7.15 0.00 1.20
## sys_bp 1.30 1.12 1.51 3.45 0.00 2.05
## bmi 1.04 0.93 1.17 0.67 0.50 1.15
## glucose 1.19 1.08 1.31 3.53 0.00 1.02
## tot_chol 1.02 0.91 1.15 0.32 0.75 1.06
## ----------------------------------------------------------------------
##
## Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0066 -0.5948 -0.4251 -0.2956 2.8764
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.4825499 0.7084404 -10.562 < 2e-16 ***
## male1 0.5549067 0.1225136 4.529 5.92e-06 ***
## prevalent_hyp1 0.3460362 0.1621075 2.135 0.032793 *
## bp_meds1 0.4291989 0.2874776 1.493 0.135442
## education2 -0.3119174 0.1506666 -2.070 0.038429 *
## education3 -0.3402578 0.1853927 -1.835 0.066456 .
## education4 -0.0446276 0.1951239 -0.229 0.819091
## age 0.0561477 0.0078555 7.148 8.83e-13 ***
## sys_bp 0.0119782 0.0034759 3.446 0.000569 ***
## bmi 0.0097577 0.0145474 0.671 0.502379
## glucose 0.0074586 0.0021109 3.533 0.000410 ***
## tot_chol 0.0004487 0.0013974 0.321 0.748166
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1872.6 on 2459 degrees of freedom
## AIC: 1896.6
##
## Number of Fisher Scoring iterations: 5
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.50
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1331 220
## 1 14 15
##
## Accuracy : 0.8519
## 95% CI : (0.8334, 0.8691)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 0.4892
##
## Kappa : 0.0837
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.063830
## Specificity : 0.989591
## Pos Pred Value : 0.517241
## Neg Pred Value : 0.858156
## Prevalence : 0.148734
## Detection Rate : 0.009494
## Detection Prevalence : 0.018354
## Balanced Accuracy : 0.526710
##
## 'Positive' Class : 1
##
Remove education, bp_meds, bmi, tot_chol since they have no effect on the model.
The AIC slightly decreases with respect to the full model.
# Remove education, bp_meds, bmi, tot_chol since they have
# no effect on the model
EVs <- c(
"male", "prevalent_hyp",
"age", "sys_bp", "glucose"
)
phat <- do_logistic_regression(EVs, thresh = 0.5)
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(5) = 232.31, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1894.04, BIC = 1928.92
##
## Standard errors: MLE
## ----------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## -------------------- ----------- ------ ------- -------- ------ ------
## (Intercept) 0.10 0.08 0.12 -21.69 0.00
## male1 1.78 1.41 2.25 4.81 0.00 1.03
## prevalent_hyp1 1.45 1.06 1.99 2.33 0.02 1.83
## age 1.67 1.47 1.89 7.94 0.00 1.13
## sys_bp 1.34 1.16 1.56 3.97 0.00 1.94
## glucose 1.19 1.08 1.30 3.55 0.00 1.01
## ----------------------------------------------------------------------
##
## Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1687 -0.5905 -0.4260 -0.3015 2.8247
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.663398 0.547542 -13.996 < 2e-16 ***
## male1 0.577359 0.120071 4.808 1.52e-06 ***
## prevalent_hyp1 0.372685 0.159882 2.331 0.01975 *
## age 0.060092 0.007570 7.939 2.04e-15 ***
## sys_bp 0.013409 0.003382 3.965 7.34e-05 ***
## glucose 0.007457 0.002098 3.553 0.00038 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1882.0 on 2465 degrees of freedom
## AIC: 1894
##
## Number of Fisher Scoring iterations: 5
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.50
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1336 221
## 1 9 14
##
## Accuracy : 0.8544
## 95% CI : (0.8361, 0.8715)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 0.378
##
## Kappa : 0.0842
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.059574
## Specificity : 0.993309
## Pos Pred Value : 0.608696
## Neg Pred Value : 0.858060
## Prevalence : 0.148734
## Detection Rate : 0.008861
## Detection Prevalence : 0.014557
## Balanced Accuracy : 0.526442
##
## 'Positive' Class : 1
##
Remove prevalent_hyp since it is only marginally significant
Note that the AIC only marginally increases (less than 0.2%) and this model is much simpler and requires only 4 variables, two of which are common demographic variables (age and sex), while the other two can be easily gathered through a pressure measurement and a blood test.
# Remove prevalent_hyp since it is only marginally significant
EVs <- c(
"male", "age", "sys_bp", "glucose"
)
phat <- do_logistic_regression(EVs, thresh = 0.5)
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(4) = 226.93, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.15
## Pseudo-R² (McFadden) = 0.11
## AIC = 1897.43, BIC = 1926.49
##
## Standard errors: MLE
## -------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ----------------- ----------- ------ ------- -------- ------ ------
## (Intercept) 0.11 0.09 0.13 -23.91 0.00
## male1 1.80 1.42 2.28 4.91 0.00 1.03
## age 1.68 1.48 1.91 8.11 0.00 1.12
## sys_bp 1.50 1.34 1.68 7.14 0.00 1.14
## glucose 1.19 1.08 1.30 3.53 0.00 1.01
## -------------------------------------------------------------------
##
## Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1631 -0.5896 -0.4326 -0.3050 2.8734
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.262714 0.486132 -16.997 < 2e-16 ***
## male1 0.588170 0.119860 4.907 9.24e-07 ***
## age 0.061119 0.007540 8.106 5.25e-16 ***
## sys_bp 0.018488 0.002588 7.144 9.08e-13 ***
## glucose 0.007425 0.002104 3.528 0.000418 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1887.4 on 2466 degrees of freedom
## AIC: 1897.4
##
## Number of Fisher Scoring iterations: 5
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.50
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1335 221
## 1 10 14
##
## Accuracy : 0.8538
## 95% CI : (0.8354, 0.8709)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 0.4053
##
## Kappa : 0.0828
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.059574
## Specificity : 0.992565
## Pos Pred Value : 0.583333
## Neg Pred Value : 0.857969
## Prevalence : 0.148734
## Detection Rate : 0.008861
## Detection Prevalence : 0.015190
## Balanced Accuracy : 0.526070
##
## 'Positive' Class : 1
##
Despite not having found a substantial effect of smoking on the 10-year risk of CHD, the automatic stepwise regression (below) identified cigs_per_day as an important predictive factor.
It is unclear to me why this is the case. However, the effect on the predictability appears relevant, therefore the last model includes also the # of cigarettes smoked per day. This is also in consideration of the fact that heavy smokers usually tend to underestimate the amount of cigarettes smoked.
The AIC decreases with respect to the previous model, and there is no evidence of dangerous collinearity.
# The stepwise regression - below - identified cigs_per_day as an important factor.
# I will therefore add it to the model.
EVs <- c(
"male", "age", "sys_bp", "glucose", "cigs_per_day"
)
phat <- do_logistic_regression(EVs, thresh = 0.5)
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(5) = 238.54, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1887.82, BIC = 1922.69
##
## Standard errors: MLE
## --------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ------------------ ----------- ------ ------- -------- ------ ------
## (Intercept) 0.12 0.10 0.14 -23.23 0.00
## male1 1.57 1.22 2.01 3.53 0.00 1.16
## age 1.75 1.54 1.99 8.50 0.00 1.17
## sys_bp 1.51 1.35 1.69 7.19 0.00 1.14
## glucose 1.19 1.08 1.31 3.57 0.00 1.01
## cigs_per_day 1.23 1.09 1.38 3.44 0.00 1.20
## --------------------------------------------------------------------
##
## Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1053 -0.5899 -0.4330 -0.2982 2.8162
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.643579 0.504565 -17.131 < 2e-16 ***
## male1 0.449590 0.127385 3.529 0.000417 ***
## age 0.065845 0.007745 8.502 < 2e-16 ***
## sys_bp 0.018684 0.002597 7.194 6.29e-13 ***
## glucose 0.007547 0.002112 3.573 0.000353 ***
## cigs_per_day 0.017858 0.005184 3.445 0.000571 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1875.8 on 2465 degrees of freedom
## AIC: 1887.8
##
## Number of Fisher Scoring iterations: 5
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.50
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1337 219
## 1 8 16
##
## Accuracy : 0.8563
## 95% CI : (0.8381, 0.8733)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 0.3
##
## Kappa : 0.0987
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.06809
## Specificity : 0.99405
## Pos Pred Value : 0.66667
## Neg Pred Value : 0.85925
## Prevalence : 0.14873
## Detection Rate : 0.01013
## Detection Prevalence : 0.01519
## Balanced Accuracy : 0.53107
##
## 'Positive' Class : 1
##
Leaving the default threshold of 0.5 for prediction leads to identifying too few true positives. For this reason, we decided to maximise the Sensitivity (TP / (TP+FN)) even at the expense of assigning high risk of CHD to some people who are not.
This rationale is motivated by the fact that in this case the decision is not about doing or not doing an open-heart surgery, but rather to start to monitor people who might be at high risk. For this reason, it is of paramount important to include all the people who might be at risk even if they will turn out not to be so.
For all these reasons, the threshold for risk detection is lowered to 0.1.
This leads to determine that almost 50% of the people who are not at risk will be deemed to be actually at risk. However, this choice will lead to a Sensitivity of about 80%.
# The main aim is to use the model to identify the highest amount of true positives
# and minimize false negatives
EVs <- c(
"male", "age", "sys_bp", "glucose", "cigs_per_day"
)
phat <- do_logistic_regression(EVs, thresh = 0.1)
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(5) = 238.54, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1887.82, BIC = 1922.69
##
## Standard errors: MLE
## --------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ------------------ ----------- ------ ------- -------- ------ ------
## (Intercept) 0.12 0.10 0.14 -23.23 0.00
## male1 1.57 1.22 2.01 3.53 0.00 1.16
## age 1.75 1.54 1.99 8.50 0.00 1.17
## sys_bp 1.51 1.35 1.69 7.19 0.00 1.14
## glucose 1.19 1.08 1.31 3.57 0.00 1.01
## cigs_per_day 1.23 1.09 1.38 3.44 0.00 1.20
## --------------------------------------------------------------------
##
## Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1053 -0.5899 -0.4330 -0.2982 2.8162
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.643579 0.504565 -17.131 < 2e-16 ***
## male1 0.449590 0.127385 3.529 0.000417 ***
## age 0.065845 0.007745 8.502 < 2e-16 ***
## sys_bp 0.018684 0.002597 7.194 6.29e-13 ***
## glucose 0.007547 0.002112 3.573 0.000353 ***
## cigs_per_day 0.017858 0.005184 3.445 0.000571 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1875.8 on 2465 degrees of freedom
## AIC: 1887.8
##
## Number of Fisher Scoring iterations: 5
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.10
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 582 37
## 1 763 198
##
## Accuracy : 0.4937
## 95% CI : (0.4687, 0.5186)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.121
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8426
## Specificity : 0.4327
## Pos Pred Value : 0.2060
## Neg Pred Value : 0.9402
## Prevalence : 0.1487
## Detection Rate : 0.1253
## Detection Prevalence : 0.6082
## Balanced Accuracy : 0.6376
##
## 'Positive' Class : 1
##
Confusion matrix and its derived metrics (manually calculated). Values are not shown. Only to double-check the values from the confusionmatrix function.
# # Binary prediction from the estimated probability values
# thresh <- 0.1
# predicted <- ifelse(phat > thresh, 1, 0)
#
# actual_predicted <- tibble(
# actual = test$ten_year_chd,
# predicted = ifelse(phat > thresh, 1, 0)
# )
#
# cm <- actual_predicted %>%
# summarise(
# TP = sum(actual == 1 & predicted == 1),
# TN = sum(actual == 0 & predicted == 0),
# FP = sum(actual == 0 & predicted == 1),
# FN = sum(actual == 1 & predicted == 0)
# )
#
# cm %>% print
#
# TP <- cm$TP
# TN <- cm$TN
# FP <- cm$FP
# FN <- cm$FN
#
#
# accuracy <- (TP + TN) / (TP+TN+FP+FN)
# print(paste0("Accuracy = ", round(accuracy,4)))
#
# sensitivity <- TP / (TP + FN)
# print(paste0("Sensitivity / Recall / TPR = ", round(sensitivity,4)))
#
# specificity <- TN / (TN + FP)
# print(paste0("Specificity / TNR = ", round(specificity,4)))
#
# PPV <- TP / (TP+FP)
# print(paste0("Positive Predictive Value = ", round(PPV,2)))
#
# NPV <- TN / (TN+FN)
# print(paste0("Negative Predictive Value = ", round(NPV,2)))
This particular stepwise regression leads to a much more complex model. It is important to note that the particular order of the predictors can change the final estimated “best” model (based on AIC).
However what is very important is that the cigs_per_day is identified as an important predictor, despite our EDA repeatedly disconfirmed so. The reasons are therefore to be explored, however for the time being we will try to include it into our manually built model (see above.)
library(MASS)
fullModel = glm(
ten_year_chd ~ male + prevalent_hyp + bp_meds + age + sys_bp + glucose + tot_chol,
family = "binomial", data = train
)
fullModel = glm(ten_year_chd ~ ., family = "binomial", data = train)
nullModel = glm(ten_year_chd ~ 1, family = "binomial", data = train)
step_fit <- stepAIC(
fullModel, direction = 'backward',
scope = list(upper = fullModel, lower = nullModel), trace = 0
)
step_fit %>% summary()
##
## Call:
## glm(formula = ten_year_chd ~ male + age + education + cigs_per_day +
## prevalent_stroke + prevalent_hyp + sys_bp + glucose, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0023 -0.5868 -0.4197 -0.2915 2.7940
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.655170 0.587371 -13.033 < 2e-16 ***
## male1 0.387895 0.129655 2.992 0.002774 **
## age 0.060750 0.007983 7.610 2.74e-14 ***
## education2 -0.335970 0.150240 -2.236 0.025337 *
## education3 -0.352595 0.184679 -1.909 0.056232 .
## education4 -0.038528 0.195294 -0.197 0.843606
## cigs_per_day 0.019113 0.005227 3.657 0.000256 ***
## prevalent_stroke1 0.969730 0.601408 1.612 0.106868
## prevalent_hyp1 0.372800 0.161896 2.303 0.021295 *
## sys_bp 0.013282 0.003412 3.893 9.92e-05 ***
## glucose 0.007787 0.002106 3.698 0.000217 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1860.2 on 2460 degrees of freedom
## AIC: 1882.2
##
## Number of Fisher Scoring iterations: 5
step_fit %>% jtools::summ(confint = T, vifs = T, digits = 2) %>% print()
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(10) = 254.11, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.17
## Pseudo-R² (McFadden) = 0.12
## AIC = 1882.24, BIC = 1946.18
##
## Standard errors: MLE
## ----------------------------------------------------------------------
## Est. 2.5% 97.5% z val. p VIF
## ----------------------- ------- ------- ------- -------- ------ ------
## (Intercept) -7.66 -8.81 -6.50 -13.03 0.00
## male1 0.39 0.13 0.64 2.99 0.00 1.19
## age 0.06 0.05 0.08 7.61 0.00 1.23
## education2 -0.34 -0.63 -0.04 -2.24 0.03 1.09
## education3 -0.35 -0.71 0.01 -1.91 0.06 1.09
## education4 -0.04 -0.42 0.34 -0.20 0.84 1.09
## cigs_per_day 0.02 0.01 0.03 3.66 0.00 1.22
## prevalent_stroke1 0.97 -0.21 2.15 1.61 0.11 1.01
## prevalent_hyp1 0.37 0.06 0.69 2.30 0.02 1.85
## sys_bp 0.01 0.01 0.02 3.89 0.00 1.97
## glucose 0.01 0.00 0.01 3.70 0.00 1.02
## ----------------------------------------------------------------------
# Estimated probability values from the logistic model
phat_step_fit = predict(step_fit, newdata = test, type = "response")
# ROC curve
# pROC::roc(test$ten_year_chd ~ phat_step_fit, plot = TRUE, print.auc = TRUE)
par(pty = "s")
pROC::roc(
test$ten_year_chd ~ phat_step_fit, plot = TRUE, print.auc = TRUE, legacy.axes = TRUE,
xlab="% False Positives", ylab = "% True Positives", col = "#377eb8", lwd = 4
)
##
## Call:
## roc.formula(formula = test$ten_year_chd ~ phat_step_fit, plot = TRUE, print.auc = TRUE, legacy.axes = TRUE, xlab = "% False Positives", ylab = "% True Positives", col = "#377eb8", lwd = 4)
##
## Data: phat_step_fit in 1345 controls (test$ten_year_chd 0) < 235 cases (test$ten_year_chd 1).
## Area under the curve: 0.7063
Evaluate predictivity of the stepwise model
# Binary prediction from the estimated probability values
thresh <- 0.1
predicted <- ifelse(phat_step_fit > thresh, 1, 0)
actual_predicted <- tibble(
actual = test$ten_year_chd,
predicted = ifelse(phat_step_fit > thresh, 1, 0)
)
caret::confusionMatrix(
as.factor(predicted), # predicted
as.factor(test$ten_year_chd), # actual
positive = "1" # the positive class in our case is "1"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 608 38
## 1 737 197
##
## Accuracy : 0.5095
## 95% CI : (0.4845, 0.5344)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1304
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8383
## Specificity : 0.4520
## Pos Pred Value : 0.2109
## Neg Pred Value : 0.9412
## Prevalence : 0.1487
## Detection Rate : 0.1247
## Detection Prevalence : 0.5911
## Balanced Accuracy : 0.6452
##
## 'Positive' Class : 1
##
Our sample is very unbalanced. We will select an equal number of people with and without risk in both train and test set to assess whether this will help model performance.
set.seed(9999)
min_train <- min(
nrow(train[train$ten_year_chd == 1,]),
nrow(train[train$ten_year_chd == 0,])
)
balanced_train <- train %>%
group_by(ten_year_chd) %>%
slice_sample(n = min_train)
min_test <- min(
nrow(test[test$ten_year_chd == 1,]),
nrow(test[test$ten_year_chd == 0,])
)
balanced_test <- test %>%
group_by(ten_year_chd) %>%
slice_sample(n = min_test)
# Full model with the selected variables
EVs <- c(
"male", "prevalent_hyp", "bp_meds", "education",
"age", "sys_bp", "bmi", "glucose", "tot_chol"
)
# Remove education, bp_meds, bmi, tot_chol since they have
# no effect on the model
EVs <- c(
"male", "prevalent_hyp",
"age", "sys_bp", "glucose"
)
# Remove prevalent_hyp since it is only marginally significant
EVs <- c(
"male", "age", "sys_bp", "glucose"
)
# The stepwise regression - below - identified cigs_per_day as an important factor.
# I will therefore add it to the model.
EVs <- c(
"male", "age", "sys_bp", "glucose", "cigs_per_day"
)
phat <- do_logistic_regression(
EVs,
train_data = balanced_train, test_data = balanced_test,
thresh = 0.4
)
## MODEL INFO:
## Observations: 756
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(5) = 137.85, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.22
## Pseudo-R² (McFadden) = 0.13
## AIC = 922.19, BIC = 949.96
##
## Standard errors: MLE
## --------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ------------------ ----------- ------ ------- -------- ------ ------
## (Intercept) 0.88 0.70 1.10 -1.15 0.25
## male1 1.38 0.99 1.92 1.89 0.06 1.12
## age 1.89 1.58 2.26 6.98 0.00 1.18
## sys_bp 1.48 1.24 1.77 4.26 0.00 1.11
## glucose 1.35 1.07 1.69 2.58 0.01 1.01
## cigs_per_day 1.43 1.20 1.71 3.99 0.00 1.20
## --------------------------------------------------------------------
##
## Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1032 -1.0034 -0.1717 1.0161 2.0910
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.238967 0.720868 -10.042 < 2e-16 ***
## male1 0.319629 0.169510 1.886 0.05935 .
## age 0.074519 0.010682 6.976 3.04e-12 ***
## sys_bp 0.015708 0.003688 4.259 2.06e-05 ***
## glucose 0.010113 0.003923 2.578 0.00994 **
## cigs_per_day 0.030757 0.007706 3.991 6.57e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1048.04 on 755 degrees of freedom
## Residual deviance: 910.19 on 750 degrees of freedom
## AIC: 922.19
##
## Number of Fisher Scoring iterations: 4
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.40
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 115 44
## 1 120 191
##
## Accuracy : 0.6511
## 95% CI : (0.6061, 0.6941)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 2.818e-11
##
## Kappa : 0.3021
##
## Mcnemar's Test P-Value : 4.727e-09
##
## Sensitivity : 0.8128
## Specificity : 0.4894
## Pos Pred Value : 0.6141
## Neg Pred Value : 0.7233
## Prevalence : 0.5000
## Detection Rate : 0.4064
## Detection Prevalence : 0.6617
## Balanced Accuracy : 0.6511
##
## 'Positive' Class : 1
##
Use the following model - estimated on the whole dataset - to predict the 10-year risk of CHD according to the given input.
\[ logOdds_{10yr-CHD-risk} = -8.671 + 0.51*male + 0.061*age + 0.0173*sys\_bp + 0.00758*glucose + 0.0204 * cigs\_per\_day \]
\[ p = \frac{e^{logOggs}}{1 + e^{logOggs}} \]
NB: only for illustrational purposes. Read and accept the following Disclaimer first